Prosper System means the proprietary, internet-based peer-to-peer lending platform developed by the Member and transferred to the Company, through which Borrowers may obtain Borrower Loans and qualified investors may purchase Securities.
Loan Listing means a request by a Borrower to borrow money on the terms of a Loan through the Prosper System.
This is a Loan Listing dataset which contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. The exploration here will only be focused on select variables of the dataset, for the purpose of this project, focus shall be restricted to about 15 selected dataset columns/variables out of the 81 count. Brief description of these variables are given as thus;
ListingNumber: The number that uniquely identifies the listing to the public as displayed on the website.
CreditScoreRangeLower: The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
ListingCategory: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
CreditScoreRangeUpper: The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
BorrowerAPR: The Borrower's Annual Percentage Rate (APR) for the loan.
EmploymentStatus: The employment status of the borrower at the time they posted the listing.
AmountDelinquent: Dollars delinquent at the time the credit profile was pulled.
DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
IncomeRange: The income range of the borrower at the time the listing was created.
RevolvingCreditBalance: Dollars of revolving credit at the time the credit profile was pulled.
AvailableBankcardCredit: The total available credit via bank card at the time the credit profile was pulled.
OpenRevolvingMonthlyPayment: Monthly payment on revolving accounts at the time the credit profile was pulled.
LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
CurrentCreditLines: Number of current credit lines at the time the credit profile was pulled.
BankcardUtilization: The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Download prosper loan dataset and save file
url = 'https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv'
df = pd.read_csv(url)
df.to_csv('prosper_loan.csv', index = False)
# load in downloaded loan data
loans = pd.read_csv('prosper_loan.csv')
loans.sample(2)
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 88769 | B9693580039408978FBC88C | 793430 | 2013-06-01 08:11:58.057000000 | NaN | 60 | Current | NaN | 0.24282 | 0.2179 | 0.2079 | ... | -64.86 | 0.0 | 0.00 | 0.00 | 0.0 | 1.0 | 0 | 0 | 0.0 | 18 |
| 34450 | 92293415338318841EB92D9 | 289019 | 2008-03-05 19:53:24.800000000 | D | 36 | Defaulted | 2009-02-12 00:00:00 | 0.17677 | 0.1550 | 0.1450 | ... | -24.01 | 0.0 | 4339.79 | 4339.79 | 0.0 | 1.0 | 0 | 0 | 0.0 | 82 |
2 rows × 81 columns
# confirm dataset shape
loans.shape
(113937, 81)
loans.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
# make a copy of dataset to preserve the original
cp_loans=loans.copy()
# make a list of selected variables/columns from the dataset
select_variables = ['ListingNumber','ListingCategory (numeric)','CreditScoreRangeLower','CreditScoreRangeUpper','BorrowerAPR','EmploymentStatus',
'AmountDelinquent','DebtToIncomeRatio','IncomeRange','RevolvingCreditBalance','AvailableBankcardCredit',
'OpenRevolvingMonthlyPayment','LoanStatus','BankcardUtilization','CurrentCreditLines']
#make a dataframe from selected variables
select_df = cp_loans[select_variables]
select_df.sample(4)
| ListingNumber | ListingCategory (numeric) | CreditScoreRangeLower | CreditScoreRangeUpper | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 77115 | 1008621 | 18 | 680.0 | 699.0 | 0.22601 | Employed | 0.0 | 0.18 | $75,000-99,999 | 5011.0 | 7989.0 | 103.0 | Current | 0.38 | 7.0 |
| 81161 | 687587 | 6 | 680.0 | 699.0 | 0.23277 | Employed | 0.0 | 0.46 | $25,000-49,999 | 37046.0 | 45223.0 | 1042.0 | Current | 0.42 | 24.0 |
| 75377 | 127575 | 0 | 640.0 | 659.0 | 0.14887 | Retired | 0.0 | 0.15 | $50,000-74,999 | 4488.0 | 1735.0 | 236.0 | Completed | 0.70 | 7.0 |
| 2067 | 852741 | 1 | 640.0 | 659.0 | 0.31032 | Employed | 671.0 | 0.24 | $25,000-49,999 | 5814.0 | 5078.0 | 254.0 | Past Due (1-15 days) | 0.47 | 8.0 |
len(select_variables)
15
# checking for null/missing values
select_df.isna().sum()
ListingNumber 0 ListingCategory (numeric) 0 CreditScoreRangeLower 591 CreditScoreRangeUpper 591 BorrowerAPR 25 EmploymentStatus 2255 AmountDelinquent 7622 DebtToIncomeRatio 8554 IncomeRange 0 RevolvingCreditBalance 7604 AvailableBankcardCredit 7544 OpenRevolvingMonthlyPayment 0 LoanStatus 0 BankcardUtilization 7604 CurrentCreditLines 7604 dtype: int64
select_df = select_df.dropna()
select_df.isna().sum()
ListingNumber 0 ListingCategory (numeric) 0 CreditScoreRangeLower 0 CreditScoreRangeUpper 0 BorrowerAPR 0 EmploymentStatus 0 AmountDelinquent 0 DebtToIncomeRatio 0 IncomeRange 0 RevolvingCreditBalance 0 AvailableBankcardCredit 0 OpenRevolvingMonthlyPayment 0 LoanStatus 0 BankcardUtilization 0 CurrentCreditLines 0 dtype: int64
#Review selected dataframe info
select_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 97903 entries, 0 to 113936 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingNumber 97903 non-null int64 1 ListingCategory (numeric) 97903 non-null int64 2 CreditScoreRangeLower 97903 non-null float64 3 CreditScoreRangeUpper 97903 non-null float64 4 BorrowerAPR 97903 non-null float64 5 EmploymentStatus 97903 non-null object 6 AmountDelinquent 97903 non-null float64 7 DebtToIncomeRatio 97903 non-null float64 8 IncomeRange 97903 non-null object 9 RevolvingCreditBalance 97903 non-null float64 10 AvailableBankcardCredit 97903 non-null float64 11 OpenRevolvingMonthlyPayment 97903 non-null float64 12 LoanStatus 97903 non-null object 13 BankcardUtilization 97903 non-null float64 14 CurrentCreditLines 97903 non-null float64 dtypes: float64(10), int64(2), object(3) memory usage: 12.0+ MB
# change data types of columns as required
select_df= select_df.astype({'CreditScoreRangeLower':'int64','CreditScoreRangeUpper':'int64','ListingNumber':'object'})
# check data types for each column
select_df.dtypes
ListingNumber object ListingCategory (numeric) int64 CreditScoreRangeLower int64 CreditScoreRangeUpper int64 BorrowerAPR float64 EmploymentStatus object AmountDelinquent float64 DebtToIncomeRatio float64 IncomeRange object RevolvingCreditBalance float64 AvailableBankcardCredit float64 OpenRevolvingMonthlyPayment float64 LoanStatus object BankcardUtilization float64 CurrentCreditLines float64 dtype: object
# create credit score range list from CreditScoreRangeLower and CreditScoreRangeUpper columnss
select_df['MedianScore'] = [int(np.median(np.arange(x,y+1))) for x,y in zip(select_df['CreditScoreRangeLower'],select_df['CreditScoreRangeUpper'])]
select_df.sample(2)
| ListingNumber | ListingCategory (numeric) | CreditScoreRangeLower | CreditScoreRangeUpper | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 99952 | 496018 | 7 | 720 | 739 | 0.29510 | Other | 0.0 | 0.16 | $1-24,999 | 12.0 | 1788.0 | 50.0 | Completed | 0.00 | 5.0 | 729 |
| 10401 | 898052 | 1 | 680 | 699 | 0.29537 | Employed | 0.0 | 0.16 | $75,000-99,999 | 45001.0 | 1713.0 | 896.0 | Current | 0.95 | 15.0 | 689 |
# Apply FICO credit score ratings
#FICO_SCORE = {'Poor': [300,579],'Fair': [580,669],'Good': [670,739],'Very Good': [740,799],'Exceptional':[800,850]}
conditions = [
(select_df['MedianScore'] < 300) | (select_df['MedianScore']>=300) & (select_df['MedianScore']<580),
(select_df['MedianScore']>=580) & (select_df['MedianScore']<670),
(select_df['MedianScore']>=670) & (select_df['MedianScore']<740),
(select_df['MedianScore']>=740) & (select_df['MedianScore']<800),
(select_df['MedianScore']>=800) & (select_df['MedianScore']<851) | (select_df['MedianScore'] > 850)
]
values = ['Poor','Fair','Good','Very Good','Exceptional']
select_df['CreditRating'] = np.select(conditions,values,default= 'NA')
select_df.head()
| ListingNumber | ListingCategory (numeric) | CreditScoreRangeLower | CreditScoreRangeUpper | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 0 | 640 | 659 | 0.16516 | Self-employed | 472.0 | 0.17 | $25,000-49,999 | 0.0 | 1500.0 | 24.0 | Completed | 0.00 | 5.0 | 649 | Fair |
| 1 | 1209647 | 2 | 680 | 699 | 0.12016 | Employed | 0.0 | 0.18 | $50,000-74,999 | 3989.0 | 10266.0 | 389.0 | Current | 0.21 | 14.0 | 689 | Good |
| 3 | 658116 | 16 | 800 | 819 | 0.12528 | Employed | 10056.0 | 0.15 | $25,000-49,999 | 1444.0 | 30754.0 | 115.0 | Current | 0.04 | 5.0 | 809 | Exceptional |
| 4 | 909464 | 2 | 680 | 699 | 0.24614 | Employed | 0.0 | 0.26 | $100,000+ | 6193.0 | 695.0 | 220.0 | Current | 0.81 | 19.0 | 689 | Good |
| 5 | 1074836 | 1 | 740 | 759 | 0.15425 | Employed | 0.0 | 0.36 | $100,000+ | 62999.0 | 86509.0 | 1410.0 | Current | 0.39 | 21.0 | 749 | Very Good |
# create non numeric category column and drop the numeric category table
conditions = [
(select_df['ListingCategory (numeric)']== 0), (select_df['ListingCategory (numeric)']== 1),
(select_df['ListingCategory (numeric)']== 2), (select_df['ListingCategory (numeric)']== 3),
(select_df['ListingCategory (numeric)']== 4), (select_df['ListingCategory (numeric)']== 5),
(select_df['ListingCategory (numeric)']== 6), (select_df['ListingCategory (numeric)']== 7),
(select_df['ListingCategory (numeric)']== 8), (select_df['ListingCategory (numeric)']== 9),
(select_df['ListingCategory (numeric)']== 10), (select_df['ListingCategory (numeric)']== 11),
(select_df['ListingCategory (numeric)']== 12), (select_df['ListingCategory (numeric)']== 13),
(select_df['ListingCategory (numeric)']== 14), (select_df['ListingCategory (numeric)']== 15),
(select_df['ListingCategory (numeric)']== 16), (select_df['ListingCategory (numeric)']== 17),
(select_df['ListingCategory (numeric)']== 18), (select_df['ListingCategory (numeric)']== 19),
(select_df['ListingCategory (numeric)']== 20)
]
values = ['Not Available','Debt Consolidation','Home Improvement','Business','Personal Loan','Student Use',
'Auto','Other','Baby and Adoption','Boat','Cosmetic Procedure','Engagement Ring','Green Loans',
'Household Expenses','Large Purchases','Medical/Dental','Motorcycle','RV','Taxes','Vacation','Wedding Loans']
select_df['ListingCategory'] = np.select(conditions,values,default='Not in Category')
select_df.drop('ListingCategory (numeric)', axis =1, inplace = True)
select_df.head()
| ListingNumber | CreditScoreRangeLower | CreditScoreRangeUpper | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 640 | 659 | 0.16516 | Self-employed | 472.0 | 0.17 | $25,000-49,999 | 0.0 | 1500.0 | 24.0 | Completed | 0.00 | 5.0 | 649 | Fair | Not Available |
| 1 | 1209647 | 680 | 699 | 0.12016 | Employed | 0.0 | 0.18 | $50,000-74,999 | 3989.0 | 10266.0 | 389.0 | Current | 0.21 | 14.0 | 689 | Good | Home Improvement |
| 3 | 658116 | 800 | 819 | 0.12528 | Employed | 10056.0 | 0.15 | $25,000-49,999 | 1444.0 | 30754.0 | 115.0 | Current | 0.04 | 5.0 | 809 | Exceptional | Motorcycle |
| 4 | 909464 | 680 | 699 | 0.24614 | Employed | 0.0 | 0.26 | $100,000+ | 6193.0 | 695.0 | 220.0 | Current | 0.81 | 19.0 | 689 | Good | Home Improvement |
| 5 | 1074836 | 740 | 759 | 0.15425 | Employed | 0.0 | 0.36 | $100,000+ | 62999.0 | 86509.0 | 1410.0 | Current | 0.39 | 21.0 | 749 | Very Good | Debt Consolidation |
# Review descriptive statistics for dataset
select_df.describe()
| CreditScoreRangeLower | CreditScoreRangeUpper | BorrowerAPR | AmountDelinquent | DebtToIncomeRatio | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | BankcardUtilization | CurrentCreditLines | MedianScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 97903.000000 | 97903.000000 | 97903.000000 | 97903.000000 | 97903.000000 | 9.790300e+04 | 97903.000000 | 97903.000000 | 97903.000000 | 97903.000000 | 97903.000000 |
| mean | 690.370060 | 709.370060 | 0.217565 | 1003.242046 | 0.276037 | 1.730366e+04 | 10949.585559 | 409.758894 | 0.565739 | 10.383962 | 699.370060 |
| std | 56.743141 | 56.743141 | 0.080075 | 7154.343694 | 0.524521 | 3.198340e+04 | 19381.728405 | 437.843818 | 0.317582 | 5.457407 | 56.743141 |
| min | 520.000000 | 539.000000 | 0.006530 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 529.000000 |
| 25% | 660.000000 | 679.000000 | 0.155380 | 0.000000 | 0.150000 | 3.159000e+03 | 845.500000 | 131.000000 | 0.310000 | 7.000000 | 669.000000 |
| 50% | 680.000000 | 699.000000 | 0.207350 | 0.000000 | 0.220000 | 8.559000e+03 | 3990.000000 | 287.000000 | 0.610000 | 10.000000 | 689.000000 |
| 75% | 720.000000 | 739.000000 | 0.280320 | 0.000000 | 0.320000 | 1.933600e+04 | 12785.500000 | 539.000000 | 0.840000 | 13.000000 | 729.000000 |
| max | 880.000000 | 899.000000 | 0.423950 | 463881.000000 | 10.010000 | 1.435667e+06 | 572427.000000 | 14985.000000 | 5.950000 | 59.000000 | 889.000000 |
select_df.drop(['CreditScoreRangeLower','CreditScoreRangeUpper'], axis = 1, inplace=True)
select_df.head()
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 0.16516 | Self-employed | 472.0 | 0.17 | $25,000-49,999 | 0.0 | 1500.0 | 24.0 | Completed | 0.00 | 5.0 | 649 | Fair | Not Available |
| 1 | 1209647 | 0.12016 | Employed | 0.0 | 0.18 | $50,000-74,999 | 3989.0 | 10266.0 | 389.0 | Current | 0.21 | 14.0 | 689 | Good | Home Improvement |
| 3 | 658116 | 0.12528 | Employed | 10056.0 | 0.15 | $25,000-49,999 | 1444.0 | 30754.0 | 115.0 | Current | 0.04 | 5.0 | 809 | Exceptional | Motorcycle |
| 4 | 909464 | 0.24614 | Employed | 0.0 | 0.26 | $100,000+ | 6193.0 | 695.0 | 220.0 | Current | 0.81 | 19.0 | 689 | Good | Home Improvement |
| 5 | 1074836 | 0.15425 | Employed | 0.0 | 0.36 | $100,000+ | 62999.0 | 86509.0 | 1410.0 | Current | 0.39 | 21.0 | 749 | Very Good | Debt Consolidation |
select_df.shape
(97903, 15)
The original dataset had the shape (113937, 81), i.e. 113,937 loans with 81 attributes (columns). preliminary data wrangling restricted the shape to (96187, 16), 16 selected attributes (columns) from the original dataset (among which includes both discrete and continuous variables), few of which were eventualy dropped and 3 attributes (MedianScore, CreditRating, listingcategory) are outcomes from feature engineering.
Mean BorrowerAPR is 0.218123, mean DebtToIncomeRatio is 0.271943, and the mean MedianScore is 700.272625 with 'Good' credit rating.
Quite a number of features in the dataset will affect the credit profile of the borrower and influence performance. The most stricking features, among others, for borrowers are the credit scores, debt to income ratios, credit ratings, income range while the ones for the loans are Revolving Credit Balances, Annual Percentage Rates (APR) and Bank Card Utilization percentage, e.t.c.
What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶
Loan status will show whether loan is completed, current, charged-off or non performing. Credit ratings and income ranges will show if customer can repay credit.
How distrubuted is borrowing accross:
IncomeRange
EmploymentStatus
LoanStatus
CreditRating
# create base color for plots
base_clolor = sns.color_palette()[0]
# create plot labels
def labels_1 (x,y,t,a):
plt.xlabel(x)
plt.ylabel(y)
plt.title(t)
plt.xticks(rotation = a)
# Univariate visualization for IncomeRange
plot_order = select_df['IncomeRange'].value_counts().index.sort_values(ascending = False)
sns.countplot(data = select_df, x = 'IncomeRange', order = plot_order, color = base_clolor )
labels_1('IncomeRange','counts','Univariate visualization for IncomeRange',90);
#describe income range
select_df['IncomeRange'].describe()
count 97903 unique 7 top $25,000-49,999 freq 29881 Name: IncomeRange, dtype: object
values = ['Not employed','Not displayed']
df = select_df[select_df['IncomeRange'].isin(values)]
df
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 66 | 122819 | 0.16802 | Not employed | 111.0 | 0.20 | Not employed | 1593.0 | 19407.0 | 16.0 | Completed | 0.07 | 4.0 | 629 | Fair | Not Available |
| 226 | 104542 | 0.11749 | Full-time | 0.0 | 10.01 | Not displayed | 152934.0 | 49014.0 | 2866.0 | Completed | 0.61 | 30.0 | 609 | Fair | Not Available |
| 694 | 116795 | 0.11749 | Full-time | 2190.0 | 0.09 | Not displayed | 0.0 | 325.0 | 10.0 | Completed | 0.00 | 6.0 | 749 | Very Good | Not Available |
| 1242 | 146341 | 0.11415 | Not employed | 0.0 | 10.01 | Not employed | 5932.0 | 22118.0 | 165.0 | Completed | 0.21 | 6.0 | 749 | Very Good | Not Available |
| 1925 | 139992 | 0.08684 | Not employed | 0.0 | 10.01 | Not employed | 2169.0 | 5882.0 | 80.0 | Defaulted | 0.10 | 9.0 | 749 | Very Good | Not Available |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113045 | 101305 | 0.09688 | Full-time | 0.0 | 0.17 | Not displayed | 9366.0 | 112734.0 | 286.0 | Completed | 0.07 | 14.0 | 809 | Exceptional | Not Available |
| 113391 | 118741 | 0.19981 | Full-time | 2249.0 | 0.27 | Not displayed | 11193.0 | 1133.0 | 442.0 | Completed | 0.89 | 13.0 | 609 | Fair | Not Available |
| 113474 | 127398 | 0.09567 | Full-time | 4.0 | 0.12 | Not displayed | 12062.0 | 41138.0 | 157.0 | Completed | 0.22 | 15.0 | 789 | Very Good | Not Available |
| 113692 | 250328 | 0.36623 | Not employed | 0.0 | 0.33 | Not employed | 2513.0 | 165.0 | 160.0 | Chargedoff | 1.02 | 15.0 | 589 | Fair | Debt Consolidation |
| 113826 | 135166 | 0.14709 | Retired | 0.0 | 0.11 | Not displayed | 2708.0 | 472.0 | 93.0 | Completed | 0.85 | 11.0 | 709 | Good | Not Available |
215 rows × 15 columns
#get unique income ranges
df.LoanStatus.unique()
array(['Completed', 'Defaulted', 'Chargedoff'], dtype=object)
The income range of borrowers for 215 loan lists were not correctly documented. They were documented as either 'Not employed' or 'Not displayed' and the loan status are either 'Completed', 'Defaulted', or 'Chargedoff'.
25,000 - 49,999 has the highest income range distribution, this is followed by 50,000 - 74,999 range.
# Univariate visualization for EmploymentStatus
plot_order = select_df['EmploymentStatus'].value_counts().index.sort_values()
sns.countplot(data = select_df, x = 'EmploymentStatus', order = plot_order, color = base_clolor )
labels_1('EmploymentStatus','Counts','Univariate visualization for EmploymentStatus',90)
select_df['EmploymentStatus'].describe()
count 97903 unique 7 top Employed freq 65896 Name: EmploymentStatus, dtype: object
select_df.head()
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 0.16516 | Self-employed | 472.0 | 0.17 | $25,000-49,999 | 0.0 | 1500.0 | 24.0 | Completed | 0.00 | 5.0 | 649 | Fair | Not Available |
| 1 | 1209647 | 0.12016 | Employed | 0.0 | 0.18 | $50,000-74,999 | 3989.0 | 10266.0 | 389.0 | Current | 0.21 | 14.0 | 689 | Good | Home Improvement |
| 3 | 658116 | 0.12528 | Employed | 10056.0 | 0.15 | $25,000-49,999 | 1444.0 | 30754.0 | 115.0 | Current | 0.04 | 5.0 | 809 | Exceptional | Motorcycle |
| 4 | 909464 | 0.24614 | Employed | 0.0 | 0.26 | $100,000+ | 6193.0 | 695.0 | 220.0 | Current | 0.81 | 19.0 | 689 | Good | Home Improvement |
| 5 | 1074836 | 0.15425 | Employed | 0.0 | 0.36 | $100,000+ | 62999.0 | 86509.0 | 1410.0 | Current | 0.39 | 21.0 | 749 | Very Good | Debt Consolidation |
# Transforming The employment status column
conditions = [
(select_df['EmploymentStatus'] == 'Employed')|(select_df['EmploymentStatus'] == 'Full-time')|(select_df['EmploymentStatus'] == 'Part-time')|
(select_df['EmploymentStatus'] == 'Self-employed'),
(select_df['EmploymentStatus'] == 'Not employed'),
(select_df['EmploymentStatus'] == 'Other'),
(select_df['EmploymentStatus'] == 'Retired')
]
values = ['Employed', 'Not employed','Other','Retired']
select_df['EmploymentStatus'] = np.select(conditions, values, default = 'NA' )
select_df.head()
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 0.16516 | Employed | 472.0 | 0.17 | $25,000-49,999 | 0.0 | 1500.0 | 24.0 | Completed | 0.00 | 5.0 | 649 | Fair | Not Available |
| 1 | 1209647 | 0.12016 | Employed | 0.0 | 0.18 | $50,000-74,999 | 3989.0 | 10266.0 | 389.0 | Current | 0.21 | 14.0 | 689 | Good | Home Improvement |
| 3 | 658116 | 0.12528 | Employed | 10056.0 | 0.15 | $25,000-49,999 | 1444.0 | 30754.0 | 115.0 | Current | 0.04 | 5.0 | 809 | Exceptional | Motorcycle |
| 4 | 909464 | 0.24614 | Employed | 0.0 | 0.26 | $100,000+ | 6193.0 | 695.0 | 220.0 | Current | 0.81 | 19.0 | 689 | Good | Home Improvement |
| 5 | 1074836 | 0.15425 | Employed | 0.0 | 0.36 | $100,000+ | 62999.0 | 86509.0 | 1410.0 | Current | 0.39 | 21.0 | 749 | Very Good | Debt Consolidation |
# Univariate visualization for the now amended EmploymentStatus
plot_order = select_df['EmploymentStatus'].value_counts().index.sort_values()
sns.countplot(data = select_df, x = 'EmploymentStatus', order = plot_order, color = base_clolor )
labels_1('EmploymentStatus','Counts','Univariate visualization for the now amended EmploymentStatus',90);
Majority of the customers in the dataset are employed (Employed, Full-time, Self-employed, and part-time)
The count of customers whose employment status are either 'Not available' or documented as 'Others' is quite significant
There are few borrowers who are either 'Not employed' or 'Retired'
# Univariate visualization for LoanStatus
plot_order = select_df['LoanStatus'].value_counts().index.sort_values()
sns.countplot(data = select_df, x = 'LoanStatus', order = plot_order, color = base_clolor )
labels_1('LoanStatus','Counts','Univariate visualization for LoanStatus',90);
select_df['LoanStatus'].describe()
count 97903 unique 12 top Current freq 52478 Name: LoanStatus, dtype: object
A larger count of the loans are 'Current' this is followed by the loans that have 'Completed' their tenor
The count of 'Cancelled' loans is very insignificant
There are significant counts of 'Chargedoff' and delinquent ('Defaulted and 'Past Due') loans probably due to poor performance
Past Due loans were isolated by number of days past due
select_df['LoanStatus'].unique()
array(['Completed', 'Current', 'Past Due (1-15 days)', 'Defaulted',
'Chargedoff', 'Past Due (16-30 days)', 'Past Due (61-90 days)',
'Past Due (31-60 days)', 'Past Due (91-120 days)',
'FinalPaymentInProgress', 'Past Due (>120 days)', 'Cancelled'],
dtype=object)
# Transforming The loan status column
conditions = [
(select_df['LoanStatus'] == 'Past Due (1-15 days)')|(select_df['LoanStatus'] == 'Defaulted')|(select_df['LoanStatus'] == 'Past Due (16-30 days)')|
(select_df['LoanStatus'] == 'Past Due (31-60 days)')|(select_df['LoanStatus'] == 'Past Due (61-90 days)')|(select_df['LoanStatus'] == 'Past Due (91-120 days)')|(select_df['LoanStatus'] == 'Past Due (>120 days)'),
(select_df['LoanStatus'] == 'Current')|(select_df['LoanStatus'] == 'FinalPaymentInProgress'),
(select_df['LoanStatus'] == 'Completed'),
(select_df['LoanStatus'] == 'Chargedoff'),
(select_df['LoanStatus'] == 'Cancelled')
]
values = ['Delinquent','Current','Completed','Chargedoff','Cancelled']
select_df['LoanStatus'] = np.select(conditions, values, default=select_df['LoanStatus'])
select_df.head()
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 0.16516 | Employed | 472.0 | 0.17 | $25,000-49,999 | 0.0 | 1500.0 | 24.0 | Completed | 0.00 | 5.0 | 649 | Fair | Not Available |
| 1 | 1209647 | 0.12016 | Employed | 0.0 | 0.18 | $50,000-74,999 | 3989.0 | 10266.0 | 389.0 | Current | 0.21 | 14.0 | 689 | Good | Home Improvement |
| 3 | 658116 | 0.12528 | Employed | 10056.0 | 0.15 | $25,000-49,999 | 1444.0 | 30754.0 | 115.0 | Current | 0.04 | 5.0 | 809 | Exceptional | Motorcycle |
| 4 | 909464 | 0.24614 | Employed | 0.0 | 0.26 | $100,000+ | 6193.0 | 695.0 | 220.0 | Current | 0.81 | 19.0 | 689 | Good | Home Improvement |
| 5 | 1074836 | 0.15425 | Employed | 0.0 | 0.36 | $100,000+ | 62999.0 | 86509.0 | 1410.0 | Current | 0.39 | 21.0 | 749 | Very Good | Debt Consolidation |
# Univariate visualization for the now transformed LoanStatus
plot_order = select_df['LoanStatus'].value_counts().index.sort_values()
sns.countplot(data = select_df, x = 'LoanStatus', order = plot_order, color = base_clolor )
labels_1('LoanStatus','Counts','Univariate visualization for the now transformed LoanStatus',90);
# Univariate plot for credit rating
plot_order = ['Exceptional','Very Good','Good','Fair','Poor']
sns.countplot(data = select_df, x = 'CreditRating', order = plot_order, color = base_clolor )
labels_1('CreditRating','Counts','Univariate plot for credit rating',90);
Larger proportion of the credit rating are good followed by very good
There's an indication that the loan performance will be good
Majority of listed loans are for borrowers with one form of employment or the other (employed, full time, part time, self employed). These employment status column was all transformed to employed. The employment status are now restricted to - employed, Not employed, Retired, and others.
In the loan status column, the past due loans were separated by the number of days past due. The defaulted loans and the loans past due were all transformed to delinquent status. The current loans and loans whose final repayment is in progress were transformed to current status. These were done so that the loan status are restricted to - current, cancelled, charged off, and delinquent.
Feature engineering was performed to get median score and credit ratings from the upper and lower credit scores. FICO credit rating style was applied to grade the credit scores into - poor, fair, good, very good, and exceptional ratings.
Credit scores were given in two different columns (upper and lower credit score). The medians of these scores were taken to for just one column for the analysis.
Loan status and employment status columns were adjusted to restrict or redefine the distribution of variables.
select_df.sample(2)
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48397 | 1119513 | 0.33204 | Employed | 209.0 | 0.44 | $25,000-49,999 | 31405.0 | 7014.0 | 975.0 | Current | 0.78 | 20.0 | 649 | Fair | Debt Consolidation |
| 92353 | 1220658 | 0.23477 | Employed | 0.0 | 0.28 | $25,000-49,999 | 3297.0 | 227.0 | 117.0 | Current | 0.91 | 5.0 | 669 | Fair | Debt Consolidation |
select_df.corr()
| BorrowerAPR | AmountDelinquent | DebtToIncomeRatio | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | BankcardUtilization | CurrentCreditLines | MedianScore | |
|---|---|---|---|---|---|---|---|---|---|
| BorrowerAPR | 1.000000 | 0.068852 | 0.058595 | -0.059577 | -0.361530 | -0.043975 | 0.275235 | -0.080149 | -0.505440 |
| AmountDelinquent | 0.068852 | 1.000000 | -0.019400 | -0.028938 | -0.019537 | -0.055292 | -0.025508 | -0.075639 | -0.067991 |
| DebtToIncomeRatio | 0.058595 | -0.019400 | 1.000000 | 0.038636 | 0.002064 | 0.085119 | 0.035600 | 0.091878 | -0.021215 |
| RevolvingCreditBalance | -0.059577 | -0.028938 | 0.038636 | 1.000000 | 0.234417 | 0.756197 | 0.243741 | 0.335055 | 0.090241 |
| AvailableBankcardCredit | -0.361530 | -0.019537 | 0.002064 | 0.234417 | 1.000000 | 0.247063 | -0.357938 | 0.317871 | 0.459706 |
| OpenRevolvingMonthlyPayment | -0.043975 | -0.055292 | 0.085119 | 0.756197 | 0.247063 | 1.000000 | 0.292950 | 0.547182 | 0.104213 |
| BankcardUtilization | 0.275235 | -0.025508 | 0.035600 | 0.243741 | -0.357938 | 0.292950 | 1.000000 | 0.111830 | -0.401831 |
| CurrentCreditLines | -0.080149 | -0.075639 | 0.091878 | 0.335055 | 0.317871 | 0.547182 | 0.111830 | 1.000000 | 0.147297 |
| MedianScore | -0.505440 | -0.067991 | -0.021215 | 0.090241 | 0.459706 | 0.104213 | -0.401831 | 0.147297 | 1.000000 |
# Investigate relationship between Annual percentage rate (APR) and credit score (Median score)
sns.relplot(data=select_df, x='BorrowerAPR', y='MedianScore')
plt.title('Relationship between Annual percentage rate (APR) and credit score (Median score)');
The median credit score and the borrower's annualpercentage rate (APR) are negatively correlated as seen from the plot
# Relationship between loan status and amount delinquent
plt.figure(figsize = [12, 12])
sns.violinplot(data=select_df, x='LoanStatus',y='AmountDelinquent')
plt.title('Relationship between loan status and amount delinquent');
The little amount of delinquent loans are majorly in Delinquent and charged off groups
# Plot the relationship between income range and median credit score
sns.boxplot(data=select_df,x='IncomeRange',y='MedianScore')
plt.title('Relationship between income range and median credit score')
plt.xticks(rotation = 90);
Outliers were observed while comparing income range and median credit score
There's a wide distribution of median credit scores in group of income range not displayed
# Plot the relationship between revolving credit balance and open revolving monthly payment
sns.regplot(data=select_df, x='RevolvingCreditBalance',y='OpenRevolvingMonthlyPayment',
x_jitter=0.3,y_jitter=0.3, scatter_kws={'alpha':1/40})
plt.title('Relationship between revolving credit balance and open revolving monthly payment');
The correlation between revolving credit balance and open revolving monthly payment is strong positive
Distribution is jampacked along the lower end of the plot
# Plot the relationship between revolving credit balance and Current Credit Lines
sns.regplot(data=select_df, x='RevolvingCreditBalance',y='CurrentCreditLines',
x_jitter=0.3,y_jitter=0.3, scatter_kws={'alpha':1/40})
plt.title('Relationship between revolving credit balance and Current Credit Lines');
The correlation between revolving credit balance and open revolving monthly payment is positive
Distribution is jampacked along the lower end of the plot
# Plot the relationship between Available Bankcard Credit and Bank card Utilization
sns.regplot(data=select_df, x='AvailableBankcardCredit',y='BankcardUtilization',
x_jitter=0.3,y_jitter=0.3, scatter_kws={'alpha':1/40})
plt.title('Relationship between Available Bankcard Credit and Bank card Utilization');
The plot shows that correlation is negative
There's a concentration of points at the lower end of the plot
# compare two qualitative variables (EmploymentStatus and CreditRating) clustered bar chart
sns.countplot(data=select_df, x='EmploymentStatus',hue='CreditRating')
plt.title('Comparison of Employment Status and Credit Rating');
The employed category of borrowers have the highest credit ratings
The highest credit rating is 'Good'
# compare two qualitative variables (ListingCategory and EmploymentStatus) clustered bar plot
sns.countplot(data=select_df, y='ListingCategory',hue='EmploymentStatus')
plt.title('Comparison of Listing Category and EmploymentStatus');
Majority of the employed borrowers have their debts consolidated
# checking for listing category and delinquent amount
x= sns.FacetGrid(data=select_df, row='ListingCategory')
x.map(plt.hist, 'AmountDelinquent')
x.fig.suptitle('Facetgrid plot for ListingCategory and AmountDelinquent', y=1.008);
The delinquent amounts are majorly concentarated in debt consolidation listing category
Having done some inexhaustive bivariate explorations on the chosen dataset, some sriking relationships were observed;
The delinquent amounts are majorly concentarated in debt consolidation listing category
Majority of the employed borrowers borrowed to consolidate previous debts making the general credit performance look good
The 'Good' rating according to FICO SCORE is the majority and the employed category of borrowers have it.
The little amount of delinquent loans are majorly in Delinquent and charged off groups of the wrngled loan status
There's a wide distribution of median credit scores in group of income range that were not displayed
Yes! Majority of the employed borrowers have their debts consolidated
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
select_df.sample(2)
| ListingNumber | BorrowerAPR | EmploymentStatus | AmountDelinquent | DebtToIncomeRatio | IncomeRange | RevolvingCreditBalance | AvailableBankcardCredit | OpenRevolvingMonthlyPayment | LoanStatus | BankcardUtilization | CurrentCreditLines | MedianScore | CreditRating | ListingCategory | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 64872 | 607062 | 0.35797 | Employed | 0.0 | 0.33 | $50,000-74,999 | 13178.0 | 691.0 | 190.0 | Chargedoff | 0.92 | 7.0 | 669 | Fair | Home Improvement |
| 85255 | 683220 | 0.17754 | Employed | 0.0 | 0.47 | $50,000-74,999 | 110403.0 | 47034.0 | 1626.0 | Completed | 0.68 | 27.0 | 689 | Good | Debt Consolidation |
num_variables = ['BorrowerAPR','BankcardUtilization','MedianScore','AmountDelinquent','DebtToIncomeRatio',
'RevolvingCreditBalance','BankcardUtilization','OpenRevolvingMonthlyPayment','CurrentCreditLines']
cat_variables = ['EmploymentStatus','IncomeRange','LoanStatus','CreditRating','ListingCategory']
# Investigation visually the correlation of numerical variables with heatmap
sns.heatmap(select_df.corr(), annot = True, fmt = '.2f', cmap = 'vlag_r', center = 0)
plt.title('Heatmap correlation between numerical variables');
Amount delinquent has weak negative correlation with other numerical variable except Borrower APR wchich it has weak positive correlation with.
The correlation between revolving credit balance and open revolving monthly repayment is strong positive.
There stands to be no correlation at all between debt to income ratio and available bank credit card.
median credit score and borrower APR are strongly negatively correlated
Availabile bank credit card seems to be mostly positively correlated with median credit score.
# What is Relationship between CreditRating, ListingCategory, LoanStatus like
p = sns.catplot(data = select_df,x = 'CreditRating', hue = 'LoanStatus', col = 'ListingCategory',
kind = 'count', palette = 'Blues', col_wrap = 3)
p.fig.suptitle('Relationship b/w CreditRating, ListingCategory, LoanStatus', y=1.0);
It is confirmed that majority of the credits currently running are for debt consolidation
A significant count of the credit items are charged off
# comparing ListingCategory, LoanStatus, BorrowerAPR, and AmountDelinquent
x = sns.FacetGrid(data = select_df, col = 'LoanStatus', row ='ListingCategory',
margin_titles = True)
x.map(plt.scatter, 'AmountDelinquent', 'BorrowerAPR')
x.fig.suptitle('Relationship Facet grid for numerical and categorical variables', y=1.008)
x.set_xticklabels(rotation=90);
The very weak positive correlation between borrower APR and amount delinquent variables, is more pronounced on debt consolidation and others listing category with loan status as completed, current, delinquent, and charged off in that order.
# plot the relationship bwtween numerical variables
g = sns.PairGrid(data = select_df, vars = num_variables)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.fig.suptitle('Scatter plot and histogram correlation grid for numerical variables', y=1.08);
Amount delinquent has weak negative correlation with other numerical variable except Borrower APR wchich it has weak positive correlation with.
The correlation between revolving credit balance and open revolving monthly repayment is strong positive.
There stands to be no correlation at all between debt to income ratio and available bank credit card.
median credit score and borrower APR are strongly negatively correlated
Availabile bank credit card seems to be mostly positively correlated with median credit score.
# Comparing categorical and numerical variables in various facets
t = sns.PairGrid(data = df, x_vars = num_variables,
y_vars = cat_variables)
t.map(sns.violinplot, color=base_clolor)
t.fig.suptitle('Relationship grid for categorical & numerical variables', y=1.08);
More examination is needed for proper communication of visualization 5 as some variables seems to be missing
# Compare RevolvingCreditBalance, OpenRevolvingMonthlyPayment, and AmountDelinquent
plt.figure(figsize = [10,8])
plt.scatter(data = select_df, x = 'RevolvingCreditBalance', y = 'OpenRevolvingMonthlyPayment',
c = 'AmountDelinquent')
plt.colorbar(label = 'AmountDelinquent ($)')
labels_1('RevolvingCreditBalance ($)', 'OpenRevolvingMonthlyPayment ($)',
'Relationship b/w RevolvingCreditBalance, OpenRevolvingMonthlyPayment & AmountDelinquent',0)
RevolvingCreditBalance and OpenRevolvingMonthlyPayment have a very strong positive correlation
Amounts Delinquent are more concetrated in the lower range, say 0 to 100,000 and sparsely distributed as we go up the legend
# Compare LoanStatus, AmountDelinquent, and EmploymentStatus
plt.figure(figsize = [12, 12])
sns.violinplot(data=select_df, x='LoanStatus', y='AmountDelinquent', hue='EmploymentStatus')
labels_1('LoanStatus','AmountDelinquent ($)','Relationship b/w LoanStatus, AmountDelinquent & EmploymentStatus',90)
The amount delinquent accross all employment status have majorly charged off and completed loan status in that order
The credit amount delinquent with current loan status are concentrated on others and retired employment status
The amount delinquent with Cancelled and Delinquent loan status are very insignificant
# Compare CreditRating, DebtToIncomeRatio, and IncomeRange variables
plt.figure(figsize = [12, 10])
sns.boxplot(data=select_df, x='CreditRating', y='DebtToIncomeRatio', hue='IncomeRange')
plt.title('Relationship b/w CreditRating, DebtToIncomeRatio & IncomeRange');
Too many outliers exist in the plot
The income ranges improperly documented as 'Not employed' and 'Not displayed' have a very significant effect on this analysis
The income ranges improperly documented as 'Not employed' and 'Not displayed' have a very significant effect on this analysis.
The amount delinquent accross all employment status have majorly charged off and completed loan status in that order. The credit amount delinquent with current loan status are concentrated on others and retired employment status. The amount delinquent with Cancelled and Delinquent loan status are very insignificant.
Amounts Delinquent are more concetrated in the lower ranges of the credit, compared to the whole dataset, the delinquncy is not significant as most of the debts have been consolidated, completed or charged off.
Yes! There are interesting interaction between features for e.g;
Borrower APR is the only variable that slightly positively influence amounts delinquent
The income ranges improperly documented as 'Not employed' and 'Not displayed' have a very significant effect on this analysis and these needs to be resolved.
You can write a summary of the main findings and reflect on the steps taken during the data exploration.
113,937 rows and 81 columns of prosper loan dataset was worked on, data wrangling and feature engineering reduced the dataset to the shape (96187, 16) and exploration was done on the resultant dataset.
Univariate explorations were carried out to investigate how distrubuted the resultant loan listing dataset is accross, IncomeRange, EmploymentStatus, LoanStatus, and CreditRating. The behaviour of the dataset was established and feature engineering was done on some variables, loan status, employment status, upper and lower credit scores and credit ratings to prepare data for visual analysis.
Bivariate explorations were done to examine the relationship between pairs of variables in the dataset and some interesting relationships between variable pairs were noticed and used for analysis.
Multivariate explorations were done to nvestigate the relationship between more than two variable. These exploratons gave more insightful comfirmation and check of the relationships and distributions noticed in the bivariate and univariate explorations respectively.
In all these analytical visual explorations, attempts were made to investigate the the performance of the listed credits. In all the explorations, the following were found:
The performance of the listed credit were generally good.
Majority of borrowers in previous debts borrowed for loan consolidation.
Improper or incomplete documentation, especially in the borrowers income ranges did not allow for exhaustive analysis
The correlation between numeric variables and amounts amount delinquent variables are generally weakly negative except for BorrowerAPR which positively influenced delinquency.
A more exhaustive research and analysis is recommended to get a more exhaustive and cnclusive result.
The results gotten from this project does not in any way imply an ehaustive conclusion. All the explorations done here are open and recommended for more indept review and research in order to reach probably better conclusions and results. This project is strictly for the purpose of ALX data analysis project.